Background

119

take nearly four times longer than FP model training. The slow training time undoubtedly

affects the easiness of industrial language models. Second, conducting QAT on memory-

limited devices is sometimes prohibited due to the increasing size of large language models.

As demonstrated in [5], the QAT method [285] even consumes 8.3 GB more memory than

FP when trained with knowledge distillation. On the contrary, PTQ methods can conduct

quantization by only caching the intermediate results of each layer, which can be fed into

memory-limited training devices. Third, the training set is sometimes inaccessible due to

industry data security or privacy issues. In contrast, PTQ constructs the small calibration

set by sampling only 1K4K instances from the whole training set.

In summary, PTQ is an appealing, efficient alternative in training time, memory over-

head, and data consumption. Generally, instead of the whole training set, PTQ methods

leverage only a small portion of training data to minimize the layer-wise reconstruction error

incurred by quantization [101, 179, 180]. The layer-wise objective breaks down the end-to-

end training, solving the quantization optimization problem in a more sample-efficient [297]

and memory-saving way. Nonetheless, it is non-trivial to directly apply previous PTQ meth-

ods for language models such as BERT [54], as the performance drops sharply. For this

reason, some efforts are investigated to improve performance.

5.1.3

Binary BERT Pre-Trained Models

Recent pre-trained BERT models have advanced the state-of-the-art performance in vari-

ous natural language tasks [227, 55]. Nevertheless, deploying BERT models on resource-

constrained edge devices is challenging due to the massive parameters and floating-

point operations (FLOPs), limiting the application of pre-trained BERT models. To mit-

igate this, model compression techniques are widely studied and applied for deploy-

ing BERTs in resource-constrained and real-time scenarios, including knowledge distilla-

tion [206, 217, 106], parameter pruning [172, 64], low-rank approximation [166, 126], weight

sharing [50, 126, 98], dynamic networks with adaptive depth and/or width [89, 255], and

quantization [280, 208, 65, 285].

Among all these model compression approaches, quantization, which utilizes lower bit-

width representation for model parameters, emerges as an efficient way to deploy compact

BERT models on edge devices. Theoretically, it compresses the model by replacing each

32-bit floating-point parameter with a low-bit fixed-point representation. Existing attempts

try to quantize pre-trained BERT [280, 208, 65] to even as low as ternary values (2-bit) with

minor performance drop [285]. More aggressively, binarization of the weights and activations

of BERT [6, 195, 222, 156, 40] could bring at most 32× reduction in model sizes and replace

most floating-point multiplications with additions, which significantly alleviate the huge

parameter and FLOPs burden.

Network binarization is first proposed in [48] and has been extensively studied in the

academia [199, 99, 159]. For BERT binarization, a general workflow is to binarize the rep-

resentation in BERT architecture in the forward propagation and apply distillation to the

optimization in the backward propagation. In detail, the forward and backward propagation

of sign function in binarized network can be formulated as:

Forward: sign(x) =



1

if x0

1

otherwise ,

(5.1)

Backward: ∂C

∂x =



∂C

sign(x)

if |x| ≤1

0

otherwise

,

(5.2)

where x is the input and C is the cost function for the minibatch. sign(·) function is applied in

the forward propagation while the straight-through estimator (STE) [9] is used to obtain the